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Abstract 

Linear regression studies the problem of estimating a model parameter (3* CR P , from n observa¬ 
tions {(yi, Xi)}" =1 from linear model yi = (xj,/3*) + e^. We consider a significant generalization 
in which the relationship between (xj,/3*) and yi is noisy, quantized to a single bit, potentially 
nonlinear, noninvertible, as well as unknown. This model is known as the single-index model 
in statistics, and, among other things, it represents a significant generalization of one-bit com¬ 
pressed sensing. We propose a novel spectral-based estimation procedure and show that we 
can recover (3* in settings (i.e., classes of link function /) where previous algorithms fail. In 
general, our algorithm requires only very mild restrictions on the (unknown) functional relation¬ 
ship between y,; and (xj,/3*). We also consider the high dimensional setting where f3* is sparse 
,and introduce a two-stage nonconvex framework that addresses estimation challenges in high 
dimensional regimes where p^> n. For a broad class of link functions between (xj,/3*) and yi, 
we establish minimax lower bounds that demonstrate the optimality of our estimators in both 
the classical and high dimensional regimes. 
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1 Introduction 


We consider a generalization of the one-bit quantized regression problem, where we seek to recover 
the regression coefficient (3* € from one-bit measurements. Specifically, suppose that X is a 
random vector in and Y is a binary random variable taking values in { — 1,1}. We assume the 
conditional distribution of Y given X takes the form 

F(Y = l\X = x) = ±f((x,l3*)) + ± (1.1) 

where / : R —>• [— 1,1] is called the link function. We aim to estimate (3* from n i.i.d. observations 
{(yj,Xj)}ih 1 of the pair (Y,X). In particular, we assume the link function / is unknown. Without 
any loss of generality, we take (3* to be on the unit sphere § p_1 since its magnitude can always be 
incorporated into the link function /. 

The model in (1.1) is simple but general. Under specific choices of the link function /, (1.1) 
immediately leads to many practical models in machine learning and signal processing, including 
logistic regression and one-bit compressed sensing. In the settings where the link function is assumed 
to be known, a popular estimation procedure is to calculate an estimator that minimizes a certain 
loss function. However, for particular link functions, this approach involves minimizing a nonconvex 
objective function for which the global minimizer is in general intractable to obtain. Furthermore, it 
is difficult or even impossible to know the link function in practice, and a poor choice of link function 
may result in inaccurate parameter estimation and high prediction error. We take a more general 
approach, and in particular, target the setting where / is unknown. We propose an algorithm that 
can estimate the parameter (3* in the absence of prior knowledge on the link function /. As our 
results make precise, our algorithm succeeds as long as the function / satisfies a single moment 
condition. As we demonstrate, this moment condition is only a mild restriction on /. In particular, 
our methods and theory are widely applicable even to the settings where / is non-smooth, e.g., 
f(z) = sign(z), or noninvertible, e.g., f(z ) = sin(z). 

In particular, as we show in Section 2, our restrictions on / are sufficiently flexible so that our 
results provide a unified framework that encompasses a broad range of problems, including logistic 
regression, one-bit compressed sensing, one-bit phase retrieval as well as their robust extensions. We 
use these important examples to illustrate our results, and discuss them at several points throughout 
the paper. 

Main contributions. The key conceptual contribution of this work is a novel use of the method 
of moments. Rather than considering moments of the covariate, X, and the response variable, Y, 
we look at moments of differences of covariates, and differences of response variables. Such a simple 
yet critical observation enables everything that follows. In particular, it leads to our spectral-based 
procedure, which provides an effective and general solution for the suite of problems mentioned 
above. In the low dimensional (or what we refer to as the classical) setting, our algorithm is simple: 
a spectral decomposition of the moment matrix mentioned above. In the high dimensional setting, 
when the number of samples, n, is far outnumbered by the dimensionality, p, important when (3* 
is sparse, we use a two-stage nonconvex optimization algorithm to perform the high dimensional 
estimation. 


2 


We simultaneously establish the statistical and computational rates of convergence of the pro¬ 
posed spectral algorithm as well as its consequences for the aforementioned estimation problems. 
We consider both the low dimensional setting where the number of samples exceeds the dimension 
(we refer to this as the “classical” setting) and the high dimensional setting where the dimension¬ 
ality may (greatly) exceed the number of samples. In both these settings, our proposed algorithm 
achieves the same statistical rate of convergence as that of linear regression applied on data gener¬ 
ated by the linear model without quantization. Second, we provide minimax lower bounds for the 
statistical rate of convergence, and thereby establish the optimality of our procedure within a broad 
model class. In the low dimensional setting, our results obtain the optimal rate with the optimal 
sample complexity. In the high dimensional setting, our algorithm requires estimating a sparse 
eigenvector, and thus our sample complexity coincides with what is believed to be the best achiev¬ 
able via polynomial time methods (Berthet and Rigollet (2013)); the error rate itself, however, is 
information-theoretically optimal. We discuss this further in Section 3.4. 

Related works. Our model in (1.1) is close to the single-index model (SIM) in statistics. In 
the SIM, we assume that the response-covariate pair (Y,X) is determined by 

Y = f((X,(3*)) + W (1.2) 

with unknown link function / and noise W. Our setting is a special case of this, as we restrict Y to 
be a binary random variable. The single index model is a classical topic, and thus there is extensive 
literature - too much to exhaustively review it. We therefore outline the pieces of work most relevant 
to our setting and our results. For estimating (3* in (1.2), a feasible approach is M-estimation 
(Hardle et ah, 1993; Delecroix et ah, 2000, 2006), in which the unknown link function / is jointly 
estimated using nonparametric estimators. Although these M-estimators have been shown to be 
consistent, they are not computationally efficient since they involve solving a nonconvex optimization 
problem. Another approach to estimate f3* is named the average derivative estimator (ADE; Stoker 
(1986)). Further improvements of ADE are considered in Powell et al. (1989) and Hristache et al. 
(2001). ADE and its related methods require that the link function / is at least differentiable, and 
thus excludes important models such as one-bit compressed sensing with f(z ) = sign(z). Beyond 
estimating f3*, Kalai and Sastry (2009) and Kakade et al. (2011) focus on iteratively estimating a 
function / and vector (3 that are good for prediction, and they attempt to control the generalization 
error. Their algorithms are based on isotonic regression, and are therefore only applicable when the 
link function is monotonic and satisfies Lipschitz constraints. The work discussed above focuses on 
the low dimensional setting where j)<n. 

Another related line of works is sufficient dimension reduction , where the goal is to fold a 
subspace U of the input space such that the response Y only depends on the projection U T W. 
Single-index model and our problem can be regarded as special cases of this problem as we are 
primarily in interested in recovering a one-dimensional subspace. Most works on this problem are 
based on spectral methods including sliced inverse regression (SIR; Li (1991)), sliced average variance 
estimation (SAVE; Cook and Lee (1999)) and principal hessian directions (PHD; Li (1992); Cook 
(1998)). The key idea behind these algorithms is to construct certain empirical moments whose 
population level structures reveal the underlying true subspace. Our moment estimator is partially 
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inspired by this idea. We highlight two differences compared to these existing works. First, our 
spectral method is based on computing covariance matrix of response weighted sample differences 
that is not considered in previous works. This special design allows us to deal with both odd and 
even link functions under mild conditions while SIR and PHD are both limited to only one of the 
two cases 1 and SAVE is more statistically inefficient than ours. Second, all the aforementioned 
works focus on asymptotic analysis while the performances (e.g., statistical rate) are much less 
understood under finite samples or even in high dimensional regime. However, dealing with high 
dimensionality with optimal statistical rate is one of our main contributions. 

In the high dimensional regime with p S> n and f3* has some structure (for us this means 
sparsity), we note there exists some recent progress (Alquier and Biau, 2013) on estimating / via 
PAC Bayesian methods. In the special case when / is linear function, sparse linear regression 
has attracted extensive study over the years (see the book Biihlmann and van de Geer (2011) and 
references therein). The recent work by Plan et al. (2014) is closest to our setting. They consider 
the setting of normal covariates, X ~ 7V(0, I p ), and they propose a marginal regression estimator 
for estimating /3*, that, like our approach, requires no prior knowledge about /. Their proposed 
algorithm relies on the assumption that [~/(~)] 7^ 0) and hence cannot work for link 

functions that are even. As we describe below, our algorithm is based on a novel moment-based 
estimator, and avoids requiring such a condition, thus allowing us to handle even link functions 
under a very mild moment restriction, which we describe in detail below. Generally, the work in 
Plan et al. (2014) requires different conditions, and thus beyond the discussion above, is not directly 
comparable to the work here. In cases where both approaches apply, the results are minimax optimal. 


2 Example models 

In this section, we discuss several popular (and important) models in machine learning and signal 
processing that fall into our framework (1.1) under specific link functions. Variants of these models 
have been studied extensively in the recent literature. These examples trace through the paper, and 
we use them to illustrate the details of our algorithms and results. 


2.1 Logistic regression 


Given the response-covariate pair ( Y , X) € { — 1 , 1 } x R p and model parameter (3 * € RP, for logistic 
regression we assume 


P(V = 1\X = x) = 


1 


1 + exp (~(x,/3*) - C) 
where £ is the intercept. Compared with our general model (1.1), we have 


( 2 . 1 ) 


_ exp (z + C) — 1 
exp (z + C) + 1' 


1 In our setting, SIR corresponds to approximating E (YX) that is 0 for even link functions; PHD corresponds to 
approximating E(FXX T ) that is 0 for odd link functions. 
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One robust variant of logistic regression is called flipped logistic regression , where we assume that 
the labels Y generated from (2.1) are flipped with probability p e , i.e., 


P(y 1|X X) l + exp(—<*,/3*)-C) + l + exp«*,/3*) + C)‘ (2 ' 2) 

This reduces to the standard logistic regression model when p e = 0. For flipped logistic regression, 
the link function / can be written as 

exp (z + C) ~ 1 2 1 - exp (z + Q 
exp (z + C) + 1 Pc 1 + exp [z + C) 

Flipped logistic regression has been studied by Natarajan et al. (2013) and Tibshirani and Manning 
(2013). In both papers, estimating /3* is based on minimizing some surrogate loss function involving 
a certain tuning parameter connected to p e . However, p e is unknown in practice. In contrast to 
their approaches, our method does not hinge on the unknown parameter p e . In fact, our approach 
has the same formulation for both standard and flipped logistic regression, and thus unifies the two 
models. 


(2.3) 


2.2 One-bit compressed sensing 

One-bit compressed sensing (e.g., Plan and Vershynin (2013a, b); Gopi et al. (2013) ) aims at recov¬ 
ering sparse signals from quantized linear measurements. In detail, we define 

Bo(s,p) = {(3 € M p : | supp(/3)| < s} (2.4) 

as the set of sparse vectors in M p with at most s nonzero elements. We assume (Y. X) €{-1,1} xR p 
satisfies 

Y = sign((X, (3*)), (2.5) 

where (3* € ®o (s,p). In this paper, we also consider its robust version with noise e, i.e., 

Y = sign«X,/F) + e). (2.6) 

Under our framework, the model in (2.5) corresponds to the link function f(z) = sign(z). Assuming 
e ~ jV(0, cr 2 ) in (2.6), the model in (2.6) corresponds to the link function 

r oo 1 

f(z ) = 2 ^^e-( u ~ z)2/2cT2 du-l. (2.7) 

Jo \/2vrcr 

It is worth pointing out that (2.6) also corresponds to the probit regression model without the 
sparse constraint on (3*. Throughout the paper, we do not distinguish between the two model 
names. Model (2.6) is referred to as one-bit compressed sensing even in the case where (3* is not 
sparse. 
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2.3 One-bit phase retrieval 


The goal of phase retrieval (e.g., Candes et al. (2013); Chen et al. (2014); Candes et al. (2014)) is 
to recover signals based on linear measurements with phase information erased, i.e., pair (Y,X) € 
1 x l p is determined by equation 


Y = \{X,(3*)\. 

Analogous to one-bit compressed sensing, we consider a new model named one-bit phase retrieval 
where the linear measurement with phase information erased is quantized to one bit. In detail, pair 
(Y,X) € { — 1,1} X M p is linked through 

Y = sign(|(X,/3*}| — 9). (2.8) 

where 6 is the quantization threshold. Compared with one-bit compressed sensing, this problem is 
more difficult because Y only depends on (3* through the magnitude of (X , (3*) instead of the value 
of (X,f3*). Also, it is more difficult than the original phase retrieval problem due to the additional 
quantization. Under our framework, the model in (2.8) corresponds to the link function 

f(z) = sign(|z| - 9). (2.9) 

It is worth noting that, unlike previous models, here / is neither odd nor monotonic. For simplicity, 
in this paper we assume the thresholding 6 is known. 

3 Main results 

In this section, we present the proposed procedure for estimating (3* and the corresponding main 
results, both for the classical, or low dimensional setting where p < n, as well as the high dimensional 
setting where we assume f3* is sparse, and accordingly have p 3> n. We first introduce a second 
moment estimator based on pairwise differences. We prove that the eigenstructure of the constructed 
second moment estimator encodes the information of (3*. We then propose algorithms to estimate 
(3* based upon this second moment estimator. In the high dimensional setting where (3* is sparse, 
computing the top eigenvector of our pairwise-difference matrix reduces to computing a sparse 
eigenvector. 

For both low dimensional and high dimensional settings, we prove bounds on the sample- 
complexity and error-rates achieved by our algorithm. We then derive the minimax lower bound 
for the estimation of f3*. In both cases, we show that our error rate is minimax optimal, thereby 
establishing the optimality of the proposed procedure for a broad model class. For the high di¬ 
mensional setting, however, our rate of convergence is a local one, which means that it holds only 
after we have a point that is close to the optimal solution. We also, therefore, give a bound on the 
sample complexity required to find a point close enough; based on recent results on sparse PCA 
(Berthet and Rigollet, 2013), it is widely believed that this is the best possible for polynomial-time 
methods. 
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3.1 Conditions for success 

We now introduce several key quantities, which allow us to state precisely the conditions required 
for the success of our algorithm. 

Definition 3.1. For any (unknown) link function, /, define the quantity </>(/) as follows: 

<K/) : = Vi ~ M0M2 + Mo- l 3 - 1 ) 

where and y 2 are given by 

y k -.= E[f(Z)Z k ], k = 0,1,2..., (3.2) 

where Z ~ Af(0, 1). 

As we discuss in detail below, the key condition for success of our algorithm is </>(/) 7 ^ 0. As we 
show below, this is a relatively mild condition, and in particular, it is satisfied by the three examples 
introduced in Section 2. In fact, if / is odd and monotonic (as in logistic regression and one-bit 
compressed sensing), by (3.2) it always holds that /io = 0, which further implies that 4>(f) = y\- 
According to (3.2), in this case we have ji\ = 0 if and only if f(z ) = 0 for all z € M. In other 
words, as long as f(z ) is not zero for all z, we have </>(/) > 0. Of course, if f(z) = 0 for all z, no 
procedure can recover (3* as Y is independent of X. For one-bit phase retrieval, Lemma 3.4 shows 
that 4>(f) > 0 when the threshold 9 in (2.8) satisfies 9 > 9 m , where 9 m is the median of \Z\ with 
Z ~ A7(0,1), and 4>(f) < 0 when 9 < 9 m . We note, in particular, that our condition </>(/) 7 ^ 0 does 
not preclude / from being discontinuous, non-invertible, or even or odd. 

3.2 Second moment estimator 

We describe a novel moment estimator that enables our algorithm. Let {(?/*, Xj)}” =1 be the n i.i.d. 
observations of ( Y , X). Assuming without loss of generality that n is even, we consider the following 
key transformation 


Ay* := y 2 i - V 2 i-i, Ax { \= x 2i - x 2i -i, (3.3) 

for i = 1,2,..., n/2. Our procedure is based on the following second moment 

2 n ^ 2 

M := - J2 AyfAxAxJ € R pxp . (3.4) 

n i =1 

It is worth noting that constructing M using the differences between all pairs of x,, and yi instead 
of the consecutive pairs in (3.3) yields similar theoretical guarantees. However, this significantly 
increases the computational complexity for calculating M when n is large. 

The intuition behind this second moment is as follows. By (1.1), the variation of X along the 
direction (3* has the largest impact on the variation of (X,f3*). Thus, the variation of Y directly 
depends on the variation of X along (3*. Consequently, {(Aj/j, Ax ,)}”^ 2 encodes the information of 
such a dependency relationship. In particular, M defined in (3.4) can be viewed as the covariance 
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matrix of {Axj}™({ weighted by Intuitively, the leading eigenvectors of M correspond to 

the directions of maximum variations within {(A yi, Axj)} -^, which further reveals information on 
(3 *. In the following, we make this intuition more rigorous by analyzing the eigenstructure of E(M) 
and its relationship with (3*. 

Lemma 3.2. For (3* € § p_1 , we assume that (Y,X) € { — 1,1} x R p satisfies (1.1). For X ~ 
jV( 0 , I p ), we have 

E(M) = 40(/) • (3*(3* t + 4(1 - ,4) ■ I p , (3.5) 

where /iq and (f>(f) are defined in (3.2) and (3.1). 

Proof. Let X and X' be two independent random vectors following A7(0,Ip). Let Y and Y' be two 
binary responses that depend on X,X r via (1.1). Then we have 

E(M) = E[(Y-Y') 2 (X-X')(X-X') T ]. 

Note that (Y — Y') 2 is a binary random variable taking values in {0,4}. We have 

E[(Y - Y') 2 !^ = x, X' = x'] = 4 • P[(Y - Y'f = A\X = x, X' = x'] 

= 4 • p(y = i\x = x) • p(y' = -i\x' = x) + 4 • p(y' = i\x' = x') • p(y = -i\x = x) 

= 2-2f({x,(3*))f({x\(3*)). (3.6) 

There exists some rotation matrix Q € M pxp such that Q/3* = e\ := [1,0,... ,0] T . Let X := QW 
and X' := QX'. Then we have 

E[(Y - Y'f\X = x,X' = x] = E[(Y - Y') 2 \X = Qx, X' = Qx] = 2 - 2 • f(x i) • 

where x\ and x' x denote the first entries of x := Qx and x' := Qx' respectively. Note that X and 

X' also follow jV(0, Ip) since symmetric Gaussian distribution is rotation invariant. Then we have 

E(M) = E { [2 - 2f(X 1 )f(X' 1 )\ (X - X')(X - X') T } 

= Q t E { [2 - 2/(X 1 )/(X' 1 )] (X - X')(X - X') T } Q 

= 4Q t [(nl - 2 + 4) ■ e ie J + (1 - 4) • Ip] Q = mf) ■ (3*(3 * t + 4(1 - $) ■ l p . 

The third equality is from the definitions of fiQ, /ii, ^2 in (3.2) and the last equality is from (3.1). □ 

Lemma 3.2 proves that (3* is the leading eigenvector of E(M) as long as the eigengap 4>{f) is 
positive. If instead we have < 0, we can use a related moment estimator which has analogous 
properties. To this end, define: 


2 n/2 

M' := - + V2i-i) 2 AxiAxJ. 

i= 1 

In parallel to Lemma 3.2, we have a similar result for M' as stated below. 


(3.7) 
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Corollary 3.3. For defined in (3.1) and M' defined in (3.7), we have 

E(M') = —4 </>(/) • (3* (3* T + 4(1 + jug) • V 

Proof. The proof is similar to that of Lemma 3.2. Using the same notation, we have 

e(m') = E[(y + y') 2 (x-x')(x-x') T ]. 

Note that 


E[(y + y') 2 |x = x, x' = x] =4-p[(y + y') 2 = 4|x = x,x' = x] 

= 2 + 2 f((x,(3*))f((x',(3*)). 


Then following the same proof of Lemma 3.2, we reach the conclusion. 


□ 


Corollary 3.3 therefore shows that when 4>(f) < 0, we can construct another second moment 
estimator M 7 such that (3* is the leading eigenvector of E(M'). As discussed above, this is precisely 
the setting for one-bit phase retrieval when the quantization threshold in (3.1) satisfies 6 < 0 m . For 
simplicity of the discussion, hereafter we assume that <$>(f) > 0 and focus on the second moment 
estimator M defined in (3.4). 

A natural question to ask is whether </>(/) 7 ^ 0 holds for specific models. The following lemma 
demonstrates exactly this, for the example models introduced in Section 2. 


Lemma 3.4. For any / : R — >• [—1,1], recall 4>(f) is defined in (3.1). Let C be an absolute constant, 
(a) For flipped logistic regression, the link function / is defined in (2.3). By setting the intercept 
to be £ = 0, we have 1 />(/) > C(1 — 2p e ) 2 . Therefore, we obtain </>(/) > 0 for p e € [0,1/2). In 
particular, we have </>(/) > 0 for the standard logistic regression model in ( 2 . 1 ), since it corresponds 
to p e = 0. (b) For robust one-bit compressed sensing, / is defined in (2.7). Recall that a 2 denotes 
the variance of the noise term e in (2.6). We have 


<«/) > 


cf 1 -* 
c Vi + ^ 2 

C'a 4 


a 2 < 


a 


> 


(1 + CT 3 ) 2 ’ 

(c) For one-bit phase retrieval, / is defined in (2.9). For Z W( 0 , 1 ), 
of |Z|, i.e., P(|Z| > 9 m ) = 1 / 2 . We have | </>(/)[ > C9\9 — 6 m \ exp(— 9 2 ) 
Therefore, we obtain (f>(f) > 0 for 9 > 9 m . 


we define 9 m to be the median 
and sign[</>(/)] = sign (9-9 m ). 


Proof. See §5.1 for a detailed proof. 


□ 


3.3 Low dimensional recovery 

We consider estimating (3* in the classical (low dimensional) setting where p <C n. Based on the 
second moment estimator M defined in (3.4), estimating (3* amounts to solving a noisy eigenvalue 
problem. We solve this by a simple iterative algorithm: provided an initial vector (3° € § p_1 (which 
may be chosen at random) we perform power iterations as shown in Algorithm 1. The performance 
of Algorithm 1 is characterized in the following result. 
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Algorithm 1 Low dimensional Recovery 
Input {(jji,Xi)}2 = number of iterations T max 
1: Second moment estimation: M 2 jn ■ Y.i=i(y 2 i-y 2 i-i) 2 (x 2 i-X 2 i-i)(x 2 i-X 2 i-i) T 

2: Initialization: Choose a random vector /3° € S n_1 
3: For t = 1,2,... ,T max do 
4: /3 4 M • /3* _1 

5 : / 3 ^/ 37||^|| 2 

6: end For 
Output /3 Tmax 


Theorem 3.5. We assume X ~ A/"(0,I p ) and (Y. X) follows (1.1). Let {(yi, Xj)}^ =1 be n i.i.d. 
samples of response input pair (Y,X). For any link function / in (1.1) with po ,</>(/) defined in 
(3.2) and (3.1), and 4>(f) > 0 2 . We let 


7 := 


1 - Mo 


,<K/) + 1 - /^O 


+ 1 




■ 7) [</>(/) + 1 - Mo] 

There exist constant C% such that when n > C\p/f 2 . we have that with probability at least 1 — 
2 exp(-C 2 p), 


1/3* — /3*|L < C 3 


<K/) + 1 - /p 


<(>(/) 


n + 


1 — a 2 




• 7 


for t = 1,... ,T max . 


(3.9) 


Statistical Error Optimization Error 
Here a = (/3°,/3), where f3 is the first leading eigenvector of M. 
Proof. See §5.2 for detailed proof. 


□ 


Note that by (3.8) we have 7 € (0,1). Thus, the optimization error term in (3.9) decreases at a 
geometric rate to zero as t increases. For T max sufficiently large such that the statistical error and 
optimization error terms in (3.9) are of the same order, we have 

||/ 3 Tm “ _/ 3*|| 2 < y/pjn- 

This statistical rate of convergence matches the rate of estimating a p-dimensional vector in linear 
regression without any quantization, and will later be shown to be optimal. This result shows that 
the lack of prior knowledge on the link function and the information loss from quantization do not 
keep our procedure from obtaining the optimal statistical rate. The proof of Theorem 3.5 is based 
on a combination of the analysis for the power method under noisy perturbation and a concentration 
analysis for M. It is worth noting that the concentration analysis is close to but different from the 
one used in principal component analysis (PCA) since M defined in (3.4) is not a sample covariance 
matrix. 

Implications for example models. We now apply Theorem 3.5 to specific models defined in 
§2 and quantify the corresponding statistical rates of convergence. 

2 Recall that we have an analogous treatment and thus results for (p(f) < 0. 
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Corollary 3.6. Under the settings of Theorem 3.5, we have the following results for a sufficiently 
large T max . 

• Flipped logistic regression: For any p e € [0,1/2), it holds that 


|p max - /TIL < max < 1, 


1 


(1 - 2 Pe) 2 . 

For p e = 0, it implies the result for the standard logistic regression model. 
Robust one-bit compressed sensing: For any a > 0, it holds that 


|/3 Tmax -/3*|| 2 <max{l,o- 2 } • 


For a = 0, it implies the result for standard one-bit compressed sensing. 

One-bit phase retrieval: For any threshold that satisfies 9 > 6 m , where 9 m is a constant defined 
in Lemma 3.4, it holds that 


|/3 Tmax - (3* |L < max 1, 


1 


9(9 - 9 m ) exp(—0 2 ) 
Proof. These results follow from combining Lemma 3.4 and Theorem 3.5 


□ 


3.4 High dimensional recovery 

Next we consider the high dimensional setting where p 2> n and (3* is sparse, i.e., (3* € § p_1 0 
Bo(s,p) with Mq(s,p) defined in (2.4) and s being the sparsity level. Although this high dimensional 
estimation problem is closely related to the well-studied sparse PCA problem, the existing works 
(Zou, 2006; Shen and Huang, 2008; d’Aspremont et al., 2008; Witten et ah, 2009; Journee et ah, 
2010; Yuan and Zhang, 2013; Ma, 2013; Vu et ah, 2013; Cai et ah, 2013) on sparse PCA do not 
provide a direct solution to our problem. In particular, they either lack statistical guarantees on 
the convergence rate of the obtained estimator (Shen and Huang, 2008; d’Aspremont et ah, 2008; 
Witten et ah, 2009; Journee et ah, 2010) or rely on the properties of the sample covariance matrix 
of Gaussian data (Cai et ah, 2013; Ma, 2013), which are violated by the second moment estimator 
defined in (3.4). For the sample covariance matrix of sub-Gaussian data, Vu et ah (2013) prove 
that the convex relaxation proposed by d’Aspremont et ah (2007) achieves a suboptimal Sy/\ogp/n 
rate of convergence. Yuan and Zhang (2013) propose the truncated power method, and show that 
it attains the optimal yjs log p/n rate locally ; that is, it exhibits this rate of convergence only in 
a neighborhood of the true solution where (/3°,(3*) > C where C > 0 is some constant. It is well 
understood that for a random initialization on S^ -1 , such a condition fails with probability going 
to one as p —>• oo. 

Instead, we propose a two-stage procedure for estimating (3* in our setting. In the first stage, 
we adapt the convex relaxation proposed by Vu et ah (2013) and use it as an initialization step, in 
order to obtain a good enough initial point satisfying the condition ((3°, (3*) > C. Then we adapt 
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the truncated power method. This procedure is illustrated in Algorithm 2. The initialization phase 
of our algorithm requires 0(s 2 logp) samples (see below for more precise details) to succeed. As 
work in Berthet and Rigollet (2013) suggests, it is unlikely that a polynomial time algorithm can 
avoid such dependence. However, once we are near the solution, as we show, this two-step procedure 
achieves the optimal error rate of fts log p/n. 

Algorithm 2 Sparse Recovery 

Input {{yi,Xi)}2_ i, number of iterations T max , regularization parameter p, sparsity level 's. 

1 : Second moment estimation: M 2/n • Jftiii{y2i—y2i-i) 2 (x2i~X2i-i)(x2i—X2i-i) T 

2 : Initialization: 

3: IT 0 argmin neRpxp {-(M,II) + p||II||i,i| Tr(II) = 1,0 A II A 1} 

4: ft <r- first leading eigenvector of II 0 

5: Z5 ■(— the set of index j’s with the top s' largest |/3°|’s 

6 : For j G { 1 ,... ,p} 

7: /3° t— 1 {j € • ft. 

8 : end For 

9 : ( 3 ° <— / 3 0 / 11 / 3°|| 2 

10 : For t= 1,2,... ,T max do 

11 : ft <-M- ft- 1 

12 : the set of index j's with the top s largest |/3j|’s 

13: For j € {1,... ,p} do 

14: % <- 1 {j € 21 } • p) 

15: end For 

16: ft^ft/\\ft\\ 2 

17: end For 

Output / 9 T,riHX 


We discuss the specific details of Algorithm 2. The initialization ft is obtained by solving the 
convex minimization problem in line 3 of Algorithm 2 and then conducting an eigenvalue decom¬ 
position. The convex minimization problem is a relaxation of the original sparse PCA problem, 
max j 3 g gp-i nBo ( s /3 t M/ 3 (see d’Aspremont et al. (2007) for details). In line 3, p > 0 is the regu¬ 
larization parameter and || ■ ||i,i denotes the sum of the absolute values of all entries. The convex 
optimization problem in line 3 can be easily solved by the alternating direction method of multipliers 
(ADMM) algorithm (see Boyd et al. (2011); Vu et al. (2013) for details). Its minimizer is denoted 
by n° € M pxp . In line 4, we set ft to be the first leading eigenvector of II 0 , and further perform 
truncation (lines 5-8) and renormalization (line 9) steps to obtain the initialization ft. After this, 
we iteratively perform power iteration (line 11 and line 16), together with a truncation step (lines 
12-15) that enforces the sparsity of the eigenvector. 

The following theorem provides simultaneous statistical and computational characterizations of 
Algorithm 2. 
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Theorem 3 . 7 . Let 


k [4(1 — (Xq) + </>(/)] / [4(1 — p 2 0 ) + 3 <j>(f)] < 1 


(3.10) 


and the minimum sample size be 


n m in ■= C ■ s 2 \ogp ■ • min{/«(l - k 1/2 )/2, k/8}/ [(1 - pi) + </>(/)] 2 ■ (3.11) 


Suppose p = C[(f>(f) + (l — pl)] yHog p/n with a sufficiently large constant C, where (j)(f) and p 0 are 
specified in (3.2) and (3.5). Meanwhile, assume the sparsity parameter s' in Algorithm 2 is set to 
be s = C // max { [1 /(k -1 / 2 —l) 2 ] ,l}-s*. For n > n m ; n with n m \ n defined in (3.11), we have 


Wpt-P'h <c 


W) + (i - M§)] a (i - Mo)^ 

< M /) 3 


slogp t 
+ K 


n 


■ y/min{(l — k 1 / 2 )/2, 1/8} (3.12) 


Statistical Error 


Optimization Error 


with high probability. Here k is dehned in (3.10). 


The first term on the right-hand side of (3.12) is the statistical error while the second term gives 
the optimization error. Note that the optimization error decays at a geometric rate since k < 1. 
For T max sufficiently large, we have 


(3 Tmax - (3* || 2 < y/s log p/n. 


In the sequel, we show that the right-hand side gives the optimal statistical rate of convergence for 
a broad model class under the high dimensional setting with n. 


3.5 Minimax lower bound 

We establish the minimax lower bound for estimating (3* in the model defined in (1.1). In the sequel 
we define the family of link functions that are Lipschitz continuous and are bounded away from ±1. 
Formally, for any m € (0,1) and L > 0, we define 

F(m,L) := {/ : \f(z)\ < 1 - m, \f(z)-f(z')\<L\z-z'\, for all z, z' G R}. (3.13) 

Let Tj :={(yi,Xi)}™ =l be the n i.i.d. realizations of (Y, X), where X follows A/"(0, I p ) and Y satisfies 
(1.1) with link function /. Correspondingly, we denote the estimator of (3* € B to be j3(XJ). where 
B is the domain of (3*. We define the minimax risk for estimating (3* as 

7 Z(n,m,L,B):= inf Jnf sup E||/3(<T?) — (3 *\\„. (3-14) 

f&T(m,L) @(xy) B 

In the above definition, we not only take the infimum over all possible estimators (3, but also all 
possible link functions in F(m, L). For a fixed /, our formulation recovers the standard definition of 
minimax risk (Yu, 1997). By taking the infimum over all link functions, our formulation characterizes 
the minimax lower bound under the least challenging / in F(m,L). In the sequel we prove that 
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our procedure attains such a minimax lower bound for the least challenging / given any unknown 
link function in That is to say, even when / is unknown, our estimation procedure is as 

accurate as in the setting where we are provided the least challenging /, and the achieved accuracy 
is not improvable due to the information-theoretic limit. The following theorem establishes the 
minimax lower bound in the high dimensional setting. Recall that Bq(s,p) defined in (2.4) is the 
set of s-sparse vectors in M p . 

Theorem 3.8. Let £> =§ p_ 1 n.E>o(s,p). We assume that n>m(l-m)/(2L 2 ) 2 -[Cs log (p/s) /2—log 2]. 
For any s € (0, _p/4], the minimax risk defined in (3.14) satisfies 


K(n,m,L,B) > C' 


— m ) 
L 



log(p/s) 

n 


Here C and C' are absolute constants, while m and L are defined in (3.13). 


Proof. See §5.4 for a detailed proof. 


□ 


Theorem 3.8 establishes the minimax optimality of the statistical rate attained by our procedure 
for p^$>n and s-sparse j3* . In particular, for arbitrary / € ) n {/ : 4>(f) > 0}, the estimator 

/3 attained by Algorithm 2 is minimax-optimal in the sense that its y/s log p/n rate of convergence 
is not improvable, even when the information on the link function / is available. The next corollary 
of Theorem 3.8 establishes the minimax lower bound for p <C n. 

Corollary 3.9. Let B = § p_1 . We suppose that n > m( 1 — m)/(2L 2 ) • ( Cp— log 2). The minimax 
risk defined in (3.14) satisfies 


lZ(n, m, L, B) > C' 


y/m{ 1 — m) 

L 



where C and C' are some absolute constants, while m and L are defined in (3.13). 


Proof. The result follows from Theorem 3.8 by setting s = p/ 4. 


□ 


It is worth to note that our lower bound becomes trivial for m = 0, i.e., there exists some z such 
that |/( 2 )| = 1. One example is the noiseless one-bit compressed sensing model defined in (2.5), 
for which we have f(z) = sign(z). In fact, for noiseless one-bit compressed sensing, the yj s log p/n 
rate is not optimal. For example, Jacques et al. (2011) (Theorem 2) provide a computationally 
inefficient algorithm that achieves rate slogp/n. Understanding such a rate transition phenomenon 
for link functions with zero margin, i.e., m = 0 in (3.13), is an interesting future direction. 


4 Numerical results 

In this section, we provide the numerical results to support our theory. We conduct two sets of 
experiments. First, we examine the eigenstructures of the second moment estimators defined in 
(3.4) and (3.7) for the following three models: flipped logistic regression (FLR) in (2.3), one-bit 
compressed sensing with Gaussian noise J\f(0, a 2 ) (one-bit CS) in (2.6) and one-bit phase retrieval 
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(one-bit PR) in (2.8). Second, for the same three models, we apply Algorithm 1 and Algorithm 2 
to parameter estimation in the low dimensional and high dimensional regimes, respectively. Our 
simulations are based on synthetic data. Given n,p, model parameter (3*, and some specific model, 
we construct our data as follows. We first generate n i.i.d. samples xi,..., x n from jV(0, I p ). Then 
for each sample Xi, we generate the corresponding label m by plugging {xi,(3*} into the specified 
model. For the first set of experiments, we set n = 3000, p = 20. We randomly select (3* from 
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(a) Flipped Logistic Regression (b) One-bit Compressed Sensing 


(c) One-bit Phase Retrieval 


Figure 1: Eigenstructure of the second moment estimators defined in (3.4) and (3.7). Panel (a) and 
(b) show the top two eigenvalues of M/4 for FLR and 1-bit CS respectively. Panel (c) shows the 
top two eigenvalues of M/4 when 9 > 9 m and the first two eigenvalues of M 7 /4 when 9 < 9 m for 
1-bit PR. 

§ p_1 . Figure 1 shows the top two eigenvalues of the second moment estimator constructed from n 
samples. Each curve is an average of 10 independent trials. In the first two models (FLR and 1-bit 
CS), as predicted by Lemmas 3.2 and 3.4, we observe that the gap between first two eigenvalues, 
corresponding to decays with noise parameter p e and a 2 . Note that the two models have 

symmetric link functions, thereby we obtain p$ = 0 and further E(M/4) = </>(/) • (3*f3* T + I p . This 
theoretical conclusion leads to the practical phenomenon that the second eigenvalue in Panel 1(a) 
and 1(b) stays close to 1 and does not change with noise level. For 1-bit. PR, when quantization 
threshold 9 < 9 m , particularly we have </>(/) < 0. In this case, as claimed in Corollary 3.3, we 
can construct second moment estimator M 7 whose expectation has top eigenvector /3* and positive 
eigen gap — </>(/). Panel 1(c) shows the existence of nontrivial eigen gap of M 7 in the region 9 < 9 m 
thus validates our theory. 

In the second set of experiments, we fix p e = 0.1, a 2 = 0.1, 9 = 1 for the three models. For 
low dimensional recovery, we randomly select (3* from S * 7 . For high dimensional recovery, we 

generate (3* as follows. Given sparsity s, we first randomly select a subset S of {1 ,...,p} with 
size s as support of (3*. We then set (3* s to be a vector that is randomly generated from S 5 ” 1 . We 
characterize the estimation error by ^2 norm. Figure 2 plots the estimation error against the quantity 
y/p/n in the low dimensional regime. Each curve is an average of 100 independent trials. We note 
that for the same value of yjp/n , we obtain almost the same estimation error in practice. Moreover, 
we observe that the estimation error has a linear dependence on y/p/n. These two empirical results 
correspond to our theoretical conclusion in Theorem 3.5. Figure 3 plots estimation error against 
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(a) Flipped Logistic Regression 



(b) One-bit Compressed Sensing 



\fpfn 


(c) One-bit Phase Retrieval 


Figure 2: Estimation error of low dimensional recovery in three models, (a) For FLR, we set flipping 
probability p e = 0.1. (b) For 1-bit CS, we set variance of Gaussian noise <5 2 = 0.1. (c) For 1-bit PR, 
we set quantization threshold 9 = 1. 




(b) One-bit Compressed Sensing 



\/s log p/n 


(c) One-bit Phase Retrieval 


Figure 3: Estimation error of sparse recovery in three models, (a) For FLR, we set flipping proba¬ 
bility p e = 0.1. (b) For 1-bit CS, we set variance of Gaussian noise 5 2 = 0.1. (c) For 1-bit PR, we 
set quantization threshold 9=1. 


^s log p/n for recovering s-sparse (3* with different values of s and p. Each curve is an average 
of 100 independent trials. Similar to low dimensional recovery, we observe that the estimation 
error is nearly proportional to yjs log p/n and the same yjs log p/n leads to approximately identical 
estimation error. This phenomenon validates Theorem 3.7. 


5 Proofs 

In this section, we provide the proofs for our main results. First we characterize the implications 
of our general framework for the models in Section 2. We then establish the statistical convergence 
rates of the proposed procedure and the corresponding minimax lower bounds. 
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5.1 Proof of Lemma 3.4 


Flipped logistic regression. For flipped logistic regression, the link function / is defined in (2.3), 
where £ is the intercept. For ( = 0, we have 

e 2 - 1 1 - e 2 

^) = f 7XT + 2 ^-TT^- 
e z + 1 1 + e z 

Note that / is odd. Hence, by (3.2) we have Ho = Hz = 0. Meanwhile, from Stein’s lemma, we have 


Hi = E [/'(*)] = E 


(1 - 2 Pe ) 


2e z 


= (1 - 2p e ) • E 


2 e 2 


(1 + e z ) 2 


(1 + e z ) 2 

We thus have </>(/) = n\> C{1 — 2 p e ) 2 for some constant C. 

Robust one-bit compressed sensing. Recall in robust one-bit compressed sensing, we have 

f(z) = 2 • P(z + e > 0) — 1, 

where e ~ A/"(0, o 2 ) is the noise term in (2.6). In particular, note that 

f(z ) + f(-z ) = 2 • [P(e > z) + P(e > -z)] -2 = 0 . 

Hence, /(z) is an odd function, which implies Ho = H 2 = 0 by (3.2). For H\ defined in (3.2), we 
have 


/xi = E[/(z)z] = E{[2-P(e>-z)-l]z} =E[P(|e| < \z\)\z\] >fi{ 1 - 2 e" 22/(2CT2) |z|} (5.1) 

r °° 2 


E(ki )- f 


—oo \/2vr 


e 2 ^e 2 |ti|du = E(|z|) (1 — 2 


<7 


1 + 


= E(M) 


1 - a 2 

1 + <7 2 ' 


Here the inequality is from the fact that P(|e| < |z|) > 1 — 2e 2 °^ since e Af(0, a 2 ). For cr 2 < 1/2, 
we have 


</>(/) = Mi > c 


1 — cr 


2 \ 2 


l + (j 2 


2 / ’ 


where C = E(|z|) with z ~ jV(0,1). For u 2 > 1 / 2 , rather than applying P(|e| < \z\) > 1 — 2 e aT 7 

_ z 2 

in the inequality of (5.1), we apply P(|e| < |z|) > j= e z° 7 \z\ since e Af( 0, a 2 ). We then obtain 


Hi > E 


,\/27n 


-e 2o-2 2 ; 


7T<7 


y/27rcr 
rco ^ 


y/2n(7 o ' —OO 


_ z C' ( cr 2 

e 272 e 2 cm > — 


,,2 2 

u 9 

o O n 


U \ 1 + (T- 


Finally, for a 2 > 1/2 we have 


m> 


C'o 


/ _4 


(l + u 2 ) 3 ' 
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One-bit phase retrieval. For the one-bit phase retrieval model, the major difference from the 
previous two models is that f(z) is even, which results in = 0. By the definition in (3.2), we 
have 


li 0 = E[f(z)]=F(\z\>9)-F(\z\<9), 

and 

H 2 = E [f(z)z 2 ] = P(|z| > 9)E(z 2 | \z\ >9)- E(\z\ < 9)E(z 2 \ \z\ < 9). 

For notational simplicity, we define p\ = P(|z| > 9). We have 

<t>{f) = Mo(Mo - M 2 ) = 2pi(2pi - 1) [l - E (z 2 | \z\ > 0)], (5.2) 

where the second equality follows from the fact that 

P(|z| > 9) + P(|z| < 9) = 1 , (5.3) 

and 

P(|z| > 9)E(z 2 | |z| > 9) +P(|z| < 9)E(z 2 | |z| < 9) = E( 2 2 ) = 1 . (5.4) 

By (5.3) and (5.4) we have p\ > 0 and E(z 2 | \z\ > 9) > 1 for 9 > 0. Hence, for 9 < 9 m with 9 m 

being the median of \z\ with z ~ A/"(0,1), we have pi > 1/2, which further implies (j>(f) < 0 by 
(5.2). Otherwise we have > 0. Thus, we have sign [</>(/)] = sign(0 — 0 m ). 

In the following we establish a lower bound for \<p(f)\. Note that 

/ 9 1 , , \ 2 f + °° 1 2 9 _el , , 

E(z 2 z > 9) = — / ._ e 2 z 2 dz =- -=e 2 + 1 . ( 5 . 5 ) 

V 1 ’ PiJe P1V2// 

Plugging (5.5) into (5.2) yields 

26 e 1 

<Kf) = ~ 2(2p 1 -l)-=e--. (5.6) 

v 27T 

For 0 < 9 < 9 m , which implies pi > 1/2, we have 

1 r& m 12 2 e 2 

n ~2 =2 i 7S e '* dz -vS e ~^ {<,m -^ (5 ' 7) 

By plugging (5.5) into (5.6), we have 

8 e 2 99 a 2 n 2 

\<Kf)\ > -i=e~^{9 m - 9)-^=e~ > C9(9 m - 9)e~. (5.8) 

v27t \2tt 

For 6 > # m , which implies pi < 1/2, similarly to (5.7), we have 

J-Pi = 2 [ -2=e~^dx> -2Le“^(0-0 m ). (5.9) 

2 Je m V2ir v 2 vr 

Thus, we conclude that 

m)\>C'9(9-9 m )e- e2 . 
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5.2 Proof of Theorem 3.5 


Let [3 be the top eigenvector of M and Ai, A 2 be the first and second largest eigenvalues of M. We 
use Ai, A 2 to denote the first and second largest eigenvalues of E(M). From Lemma 3.2, we already 
know that 


Ai = 40(/) + 4(1 — /Lig), and A 2 = 4(1-//§). 


By the triangle inequality, we have 

The hrst term on the right hand side is the statistical error and the second term is the optimization 
error. From standard analysis of the power method, we have 




l — a 2 


or 


(•WAi)\ 


where a = (/3°, (3). By the definition in (3.4), M is the sample covariance matrix of to/ 2 independent 
realizations of the random vector (Y — Y'){X — X') £ M p . Since X is Gaussian and Y is bounded, 
(Y — Y'){X — X') is sub-Gaussian. By standard concentration results (see e.g. Theorem 5.39 in 
Vershynin (2010)), there some constants C,C 1 such that for any t > 0, with probability at least 
l-2e~ ct \ 


|M-E(M )|| 2 <max(«y, <5 2 )||E(M)|| 2 , 


where 5 = Ciy f + ^=. We let t = y/p, then for any £ € (0,1), we have that ||M — E(M )||2 < 
£||E(M )||2 when to > C 2 p /£ 2 for sufficiently large constant C 2 . Conditioning on ||M — E(M )||2 < 
£||E(M)|| 2 , from Weyl’s inequality, we have 

Ai > 4(1 - £) [</(/) + 1 - pi], and A 2 < 4£ 0 (/) + 4(1 + £)(1 - pi). 

Furthermore, for any 7 € ((1 — pi) j [</>(/) + 1 — pi ], l), by restricting 

7<K/) + (7- 1)0--Po) 


e< 


(1 + 7 ) [<Kf) + 1 - pI] 


(5.10) 


we have 


\B‘-B\\ 2 £ \f 


1 — a 2 


a* 


.7 


Now we turn to the statistical error. By Wedin’s sin theorem, for some positive constant C > 0, we 
have 


sin Z (/3*, (3) < C 


£||E(M)||; 

Ai — A2 


(5.11) 
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Elementary calculation yields 


||0-/3*|| 2 = 2sin[Z(/3*, / 9)/2] < v^sin Z(f3*,(3). 
As £ < \fpfn, combining (5.11) and (5.12), we have 

Hf) + 1 ~ [p 


(5.12) 


/3-r < 


<K/) 


Putting all pieces together, we conclude that if £ satisfies (5.10) and n > p/^ 2 , then we have that 
with probability at least 1 — 2e~ Cp , 


it -*11 ^(/) + !-Mo [p , . l ~ a2 * 


/3 — /3* L < C ' 


<K/) 


n + 


Q 1 


•7 


as required. 


5.3 Proof of Theorem 3.7 


The analysis of Algorithm 2 follows from a combination of Vu et al. (2013) (for the initialization via 
convex relaxation) and Yuan and Zhang (2013) (for the original truncated power method). Recall 
that k, is defined in (3.10). Assume the initialization (3° is s-sparse with ||/3°||2 = 1, and satisfies 

^k(1 -k 1 / 2 )/2,v'2k/4| , (5.13) 


1/3 — ( 3 * |L < C min 


for s' = C’ max { |T/(/c x / 2 — 1) 2 ~| , l} • s. Theorem 1 of Yuan and Zhang (2013) implies that 

||/3‘ - 0% < C" - M + - ^>1 1 (1 - ^ + . v / min { (1 _ k i/ 2)/ 2, 1/8} 

with high probability. Therefore, we only need to prove the initialization (3° obtained in Algorithm 

2 satisfies the condition in (5.13). 

Corollary 3.3 of Vu et al. (2013) shows that the minimize! - to the minimization problem in line 

3 of Algorithm 2 satisfies 

||n° - D- ■ (/3*) t II <c w) + Z 1 ~ . /Ml 

m wz ^ V n 

with high probability. Corollary 3.2 of Vu et al. (2013) implies, the first eigenvector of II 0 , denoted 
as (3°, satisfies 



13* 


\ 2 <C' 


<M/) + (1 - Po) /log P 


<t> 


n 


with the same probability. However, (3° is not necessarily s-sparse. Using Lemma 12 of Yuan and Zhang 
(2013), we obtain that the truncate step in lines 12-15 of Algorithm 2 ensures that (3° is //-sparse 
and also satisfies 


(3° - /3*|| 2 < (1 + 2^/Jfs) • ||/3° - (3T\\ 2 < 3||/3° - f3%, 
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where the last inequality follows from our assumption that s' > s. Therefore, we only have to set n 
to be sufficiently large such that 


||/3° - P*\\ 2 < C' ■ /i ° ) • s\j ^ v^}, 

which is ensured by setting n > n m i n with 

n min = C' ■ s 2 log p-4>(f) 2 • min{/t(l - k 1/2 )/ 2,k/ 8}/ [(1 - ho) + <M/)] 2 , 
as specified in our assumption. Thus we conclude the proof. 


5.4 Proof of Theorem 3.8 

The proof of the minimax lower bound follows from the basic idea of reducing an estimation problem 
to a testing problem, and then invoking Fano’s inequality to lower bound the testing error. We first 
introduce a finite packing set for S ^ -1 n E>o(s,p). 

Lemma 5.1. Consider the set {0, 1} P equipped with Hamming distance 5. For s < p/4, there exists 
a finite subset Q C {0,1} P such that 

<5(0, O') > s/2, V(0, O') <e Q x Q and 0 ± O', ||0|| o = s, for all 0 € Q. 

The cardinality of such a set satisfies 


log(|Q|) > 8/3 • slog(p/s). 


Proof. See the proof of Lemma 4.10 in Massart and Picard (2007). 


□ 


We use Q(p, s) to denote the finite set specified in Lemma 5.1. For £ < 1, we construct a finite 
subset Q(p, s,f) C § p_1 nBo(s,p) as 


Q(p,s,0 


j/3 € : f3 




Vs — 1 



where w G Q(p — 1, s 



(5.14) 


It is easy to verify that set Q(p, s,£) has the following properties: 

• For any 0 € Q(p,s,£), it holds that ||0||2 = 1 and ||0||o = s. 

• For distinct 0, O' € Q(p, s, £), ||0 — O' ||2 > %/2£/2 and ||0 — 0'||2 < V^V 

• log |Q(p,s,£)| > Cslog(p/s) for some positive constant C. 

In order to derive lower bound of 7 Z(n,m, L,B) with B = § p_1 n Bq(s,p), we assume that the 
inhmum over / in (3.14) is obtained for a certain f* € J-(m,L), namely 

lZ(n,m,L,B)= inf sup E||/3(T’jl*) — /3|| 2 > inf sup E||/3(A’”*) — /3|| 2 . 

^GSp- 1 f3eSP~ 1 nBa(s,p) p&r- 1 peQ(p, s ,£) 


21 









Note that for any £ > 0, we have ||/3i — /CJ 2 II 2 > for any two distinct vectors {(3i,(32) in Q(p, s,£). 
Therefore, we are in a position to apply standard minimax risk lower bound. Following Lemma 3 
in Yu (1997), we obtain 


- /n 

inf sup E||/3(Yj») — /3|| 2 > —— £ 

^sSp - 1 peQ(p,s,£) 4 



max / 3 ,/ 3 'eg(p, 5 ,Q d kl(P( 3'\\Pp) + log 2 

log|Q(p,s,OI 


(5.15) 


In the following, we derive an upper bound for the term involving KL divergence on the right hand 
side of the above inequality. For any (3,(3' € Q(p,s , £), we have 


D KL (Pp'\\Pp) < n • D KL [Pp{Y, X)\\Pp{Y, X)] = n ■ E x {D kl [Pp(Y\X)\\Pp(Y\X)\ } 

= r ■ M [1 +r(xT ' 3 >i los Trfifw) + [1 - nxTp)] log 


<-n-E x \ [1 + f*(X T (3)} 


1 + f*(X T (3) 
l + f*(X^(3') 


- 1 


+ [l-f*(X J (3)} 


1 -f*(X T (3) 
LW*(X t /3') 


- 1 


(5.16) 


In the last inequality, we utilize the fact that log z <2 — 1. Then by elementary calculation, we 
have 


Dkl{P( 3'\\Pp) < n • Ex 


[r(x T (3)-nx T (3')i 


[1 + /* {XTpj\ -[I-/* ( X T(3')\ 
Using 1 /( 2 ) | < 1 — m and the Lipschitz continuity condition of /, we have 


Dkl{Pp'\\P p) • Ex 


L 2 (X,(3-(3') 2 \ = nL 2 \\(3 - (3'\\ 2 < 2 nL 2 j 2 
m(l — m) J m( 1 — m) ~ m( 1 — m ) 


(5.17) 


(5.18) 


Note that (5.16)-(5.18) hold for any (3,(3' € Q(p,s,£). We thus have 

2 nL 2 ^ 2 


D K L{P(3'\\Pp) < —r -r- 

m(l — m) 

Now we proceed with (5.15) using the above result. The right hand side is thus lower bounded by 


max 

P,P'GQ(p,s, < 


\/2 / 2L 2 n£ 2 /[m(l — m)] + log2\ > ^2 / 2L 2 n£ 2 /[m(l — m)] + log 2 

4 V |Q(p,s,OI / 4 V Cslog(p/s) 


where the last inequality is from |Q(p,s,£)l — Cslog(p/s). Finally, consider the case where the 
sample size n is sufficiently large such that 

n > ■ [Cs log(p/s)/2 - log 2], 

by choosing 

£ 2 = • [C* 5 log(p/ s )/2 — log 2], (5.19) 

we thus have 


as required. 


n{n,m,L,B) > C’ 


y 7 m{ 1 — m) 
L 



log(p/s) 

n 
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